layout generation
- Asia > China > Shaanxi Province > Xi'an (0.04)
- North America > United States (0.04)
- Asia > Middle East > Israel (0.04)
HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation
The task of layout-to-image generation involves synthesizing images based on the captions of objects and their spatial positions. Existing methods still struggle in complex layout generation, where common bad cases include object missing, inconsistent lighting, conflicting view angles, etc. To effectively address these issues, we propose a \textbf{Hi}erarchical \textbf{Co}ntrollable (HiCo) diffusion model for layout-to-image generation, featuring object seperable conditioning branch structure. Our key insight is to achieve spatial disentanglement through hierarchical modeling of layouts. We use a multi branch structure to represent hierarchy and aggregate them in fusion module. To evaluate the performance of multi-objective controllable layout generation in natural scenes, we introduce the HiCo-7K benchmark, derived from the GRIT-20M dataset and manually cleaned.
Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation
As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task.
Tokenizing Buildings: A Transformer for Layout Synthesis
de Guevara, Manuel Ladron, Rhee, Jinmo, Bidgoli, Ardavan, Razgaitis, Vaidas, Bergin, Michael
W e introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. W e address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. W e then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities--referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts--with fewer collisions and boundary violations and improved navigability. 1
- North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)
- North America > United States (0.04)
M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
Zhang, Yiheng, Cai, Zhuojiang, Wang, Mingdao, Guo, Meitong, Li, Tianxiao, Lin, Li, Wang, Yuwang
In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 21,367 layouts and over 433k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using both a text-conditioned diffusion model and a text-conditioned autoregressive model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis. All dataset and code will be made public upon acceptance.
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Europe > Switzerland (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
APD-Agents: A Large Language Model-Driven Multi-Agents Collaborative Framework for Automated Page Design
Chen, Xinpeng, Han, Xiaofeng, Zhang, Kaihao, Ren, Guochao, Wang, Yujie, Cao, Wenhao, Zhou, Yang, Lu, Jianfeng, Song, Zhenbo
Layout design is a crucial step in developing mobile app pages. However, crafting satisfactory designs is time-intensive for designers: they need to consider which controls and content to present on the page, and then repeatedly adjust their size, position, and style for better aesthetics and structure. Although many design software can now help to perform these repetitive tasks, extensive training is needed to use them effectively. Moreover, collaborative design across app pages demands extra time to align standards and ensure consistent styling. In this work, we propose APD-agents, a large language model (LLM) driven multi-agent framework for automated page design in mobile applications. Our framework contains OrchestratorAgent, SemanticParserAgent, PrimaryLayoutAgent, TemplateRetrievalAgent, and RecursiveComponentAgent. Upon receiving the user's description of the page, the OrchestratorAgent can dynamically can direct other agents to accomplish users' design task. To be specific, the SemanticParserAgent is responsible for converting users' descriptions of page content into structured data. The PrimaryLayoutAgent can generate an initial coarse-grained layout of this page. The TemplateRetrievalAgent can fetch semantically relevant few-shot examples and enhance the quality of layout generation. Besides, a RecursiveComponentAgent can be used to decide how to recursively generate all the fine-grained sub-elements it contains for each element in the layout. Our work fully leverages the automatic collaboration capabilities of large-model-driven multi-agent systems. Experimental results on the RICO dataset show that our APD-agents achieve state-of-the-art performance.
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Research Report (0.82)
- Workflow (0.49)
Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation
Brioschi, Riccardo, Alekseev, Aleksandr, Nevali, Emanuele, Döner, Berkay, Malki, Omar El, Mitrevski, Blagoj, Kieliger, Leandro, Collier, Mark, Maksai, Andrii, Berent, Jesse, Musat, Claudiu, Kokiopoulou, Efi
Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. W e introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. T o tackle the sketch-to-layout problem, we propose a mul-timodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. W e train and evaluate our model on three publicly available datasets: PubLayNet [43], DocLayNet [32] and SlidesVQA [35], demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above.
- Asia > Middle East > Jordan (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Middle East > Cyprus > Nicosia > Nicosia (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning
Ran, Xingjian, Li, Yixuan, Xu, Linning, Yu, Mulin, Dai, Bo
Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.